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Abstract 

This  paper  addresses  the  problem  of  automatic  acquisition  of  lexical  knowledge  for  rapid 
construction  of  MT  engines  %DL:  delete  for  use  in  multilingual  applications.  We  de¬ 
scribe  new  techniques  for  large-scale  construction  of  a  Chinese-English  verb  lexicon  and 
we  evaluate  the  coverage  and  effectiveness  of  the  resulting  lexicon  for  a  structured  MT 
approach  that  is  embedded  in  a  cross-language  information  retrieval  system.  Leveraging 
off  an  existing  Chinese  conceptual  database  called  HowNet  and  a  large,  semantically  rich 
English  verb  database,  we  use  thematic-role  information  to  create  links  between  Chinese 
concepts  and  English  classes.  We  apply  the  metrics  of  recall  and  precision  to  evaluate  the 
coverage  and  effectiveness  of  the  linguistic  resources.  The  results  of  this  work  indicate 
that:  (1)  we  are  able  to  obtain  reliable  Chinese-English  entries  both  with  and  without 
pre-existing  semantic  links  between  the  two  languages;  (2)  if  we  have  pre-existing  se¬ 
mantic  links,  we  are  able  to  produce  a  more  robust  lexical  resource  by  merging  these 
with  our  semantically  rich  English  database;  (3)  In  our  comparisons  with  manual  lexicon 
creation,  our  automatic  techniques  were  shown  to  achieve  62%  precision,  compared  to  a 
much  lower  precision  of  10%  for  arbitrary  assignment  of  semantic  links. 
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Abstract.  This  paper  addresses  the  problem  of  automatic  acquisition  of  lexi¬ 
cal  knowledge  for  rapid  construction  of  MT  engines  multilingual  applications.  We 
describe  new  techniques  for  large-scale  construction  of  a  Chinese-English  verb  lex¬ 
icon  and  we  evaluate  the  coverage  and  effectiveness  of  the  resulting  lexicon  for  a 
structured  MT  approach  that  is  embedded  in  a  cross-language  information  retrieval 
system.  Leveraging  off  an  existing  Chinese  conceptual  database  called  HowNet  and 
a  large,  semantically  rich  English  verb  database,  we  use  thematic-role  information 
to  create  links  between  Chinese  concepts  and  English  classes.  We  apply  the  metrics 
of  recall  and  precision  to  evaluate  the  coverage  and  effectiveness  of  the  linguistic 
resources.  The  results  of  this  work  indicate  that:  (1)  we  are  able  to  obtain  reliable 
Chinese-English  entries  both  with  and  without  pre-existing  semantic  links  between 
the  two  languages;  (2)  if  we  have  pre-existing  semantic  links,  we  are  able  to  produce 
a  more  robust  lexical  resource  by  merging  these  with  our  semantically  rich  English 
database;  (3)  In  our  comparisons  with  manual  lexicon  creation,  our  automatic  tech¬ 
niques  were  shown  to  achieve  62%  precision,  compared  to  a  much  lower  precision  of 
10%  for  arbitrary  assignment  of  semantic  links. 

Keywords:  Machine  translation.  Cross-language  information  retrieval.  Embedded 
MT  Resources,  Chinese-English  lexicons.  Thematic  roles.  Lexical  acquisition 


1.  Introduction 


The  growing  quantity  of  online  multilingual  information,  e.g.,  on  the 
Web,  has  created  an  urgent  need  for  rapid  construction  of  lexical  re¬ 
sources.  Automatic  and  semi-automatic  techniques  for  lexical  acquisi¬ 
tion  are  more  critical  now  than  ever  before  as  it  becomes  infeasible  to 
produce  adequate  semantic  representations  on  a  large  scale  by  human 
labor  alone.  We  describe  a  new  linguistically-based  approach  to  large- 
scale  construction  of  a  semantic  lexicon  for  Chinese  verbs.  This  resource 
is  used  by  a  machine  translation  module  whose  sub-components  are 
embedded  in  a  system  for  English- Chinese  cross-language  information 
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retrieval.  We  focus  specifically  on  the  problem  of  compensating  for  gaps 
in  the  existing  resources  and  we  apply  the  metrics  of  recall  and  precision 
to  evaluate  the  coverage  and  effectiveness  of  our  lexicon. 

W’e  leverage  off  three  existing  resources:  (!)  a  classihcation  of  En¬ 
glish  verbs  based  on  EVCA  (English  Verbs  Classes  and  Alternations) 
(Levin,  1993),  in  a  signihcantly  enhanced  form  called  the  LCS  Verb 
Database  (LVD)  (Dorr,  2001);  (2)  a  Chinese  conceptual  database  called 
HowNet^  (Dong,  1988c),  (Dong,  1988b),  (Dong,  1988a);  and  (3)  a  large 
machine  readable  Chinese- English  dictionary  called  Optilex.^  We  use 
thematic-role  information  (e.g.,  a  mapping  between  the  HowNet  “Pa¬ 
tient”  and  the  LVD-based  “Th(eme)”)  to  create  links  between  Chinese 
concepts  and  English  classes.  Each  Chinese- English  link  is  additionally 
associated  with  a  sense  from  WordNet  (MiUer  and  Fellbaum,  1991), 
(Fellbaum,  1998),  thus  producing  a  new  Asian  companion  to  the  cur¬ 
rent  (Euro)WordNet  initiative.  Finally,  we  use  the  LVD-based  semantic 
classes,  our  thematic  role  mapping,  and  canonical  Enghsh  words  to  pro¬ 
duce  a  large  set  of  Lexical  Conceptual  Structures  used  in  the  embedded 
machine-translation  components. 

Until  this  year,  HowNet  contained  no  English  translations.  Thus,  our 
initial  experiments  used  Optilex  to  produce  candidate  Enghsh  trans¬ 
lations.  In  the  latest  version  of  HowNet  (Dong,  2000),  the  Enghsh 
translations  are  included;  however,  our  work  has  provided  the  basis  for 
increasing  recall — acquisition  of  thousands  of  correct  Chinese- English 
entries  that  do  not  currently  exist  in  HowNet — and,  moreover,  it  has 
provided  a  link  into  the  semantic  classes  underlying  a  large  Enghsh 
conceptual  database.  Since  the  new  HowNet  was  released,  we  have 
been  able  to  execute  a  more  accurate  evaluation  of  our  Chinese- English 
links — in  particular,  we  use  the  Enghsh  translations  in  HowNet  to  de¬ 
termine  the  precision  of  our  approach  (overall  accuracy  of  the  Chinese- 
English  hnks  we  have  already  automaticahy).  Finally,  given  that  our 
initial  work  did  not  make  use  of  the  Enghsh  translations  in  HowNet, 
we  expect  those  same  techniques  to  be  generally  apphcable  to  other 
foreign  language  semantic  hierarchies  where  Enghsh  translations  are 
not  available.  We  predict  this  will  occur  more  and  more  frequently,  as 
online  (non-bilingual)  hnguistic  resources  continue  to  be  made  available 
in  multiple  languages  (see,  for  example,  (Hovy,  1998)). 

The  lexicons  resulting  from  our  acquisition  approach  are  used  for 
determining  word  senses  in  a  machine  translation  (MT)  module  whose 
sub-components  are  embedded  in  a  cross-language  information  retrieval 
(CLIR)  system — ahowing  the  user  to  access  Chinese  documents  using 

^  Available  at:  littp://www. keenage.com 

^  Optilex  is  a  large  (600k  entries)  machine-readable  version  of  the  CETA  Chinese- 
English  dictionary,  licensed  from  the  MRM  Corporation,  Kensington,  MD. 
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English  as  their  query  language.  The  importance  of  determining  word 
senses  in  embedded  MT  is  clear  when  one  considers  the  degree  of 
inaccuracy  that  might  result  from  using  a  weak  alternative,  such  as 
access  to  a  bilingual  word  list. 

The  next  section  relates  our  work  to  that  of  other  researchers  who 
have  investigated  the  problem  of  mapping  across  semantic  hierarchies 
for  construction  of  MT  resources.  Following  this,  we  describe  Lexical 
Conceptual  Structure  (LCS) — an  interlingual  representation  used  in  our 
cross-language  applications — and  we  illustrate  how  this  representation 
is  used  as  the  basis  of  our  embedded  MT  approach.  After  this,  we  de¬ 
scribe  the  structure  of  our  existing  lexical  resources,  HowNet  and  LVD, 
and  we  show  that  mapping  between  these  two  resources  results  in  a  new 
resource:  a  Chinese  LCS  lexicon  used  for  our  embedded  MT  approach. 
Finally,  we  demonstrate  that  our  automatic  acquisition  techniques  pro¬ 
vide  a  framework  for  compensating  for  gaps  in  this  new  resource.  We 
evaluate  the  coverage  and  accuracy  of  the  new  Chinese  LCS  lexicon 
with  respect  to  the  pre-existing  LVD  and  HowNet  resources.  We  con¬ 
clude  that:  (!)  we  are  able  to  obtain  reliable  Chinese- Enghsh  entries 
both  with  and  without  pre-existing  semantic  links  between  the  two 
languages;  (2)  if  we  have  pre-existing  semantic  hnks,  we  are  able  to 
produce  a  more  robust  lexical  resource  by  merging  these  with  our  se¬ 
mantically  rich  Enghsh  database;  (3)  In  our  comparisons  with  manual 
lexicon  creation,  our  automatic  techniques  were  shown  to  achieve  62% 
precision,  compared  to  a  much  lower  precision  of  10%  for  arbitrary 
assignment  of  semantic  links. 


2.  Mapping  Across  Semantic  Hierarchies 

Several  researchers  have  investigated  the  problem  of  assigning  class- 
based  senses  to  verbs  (Dorr  et  ah,  1997),  (Palmer  and  Wu,  1995), 
(Palmer  and  Rosenzweig,  1996),  (Palmer  et  ah,  1997)  using  a  variety 
of  online  resources  including  Longman’s  Dictionary  of  Contemporary 
English  (LDOCE)  (Procter,  1978),  EVCA  (Levin,  1993),  and  WordNet 
(Miller  and  Fenbaum,  1991),  (Fenbaum,  1998).  Translation  of  Enghsh 
classes  into  other  languages  has  proven  difficult  (Jones  et  ah,  1994), 
(Nomura  et  ah,  1994),  ( Saint -Dizier,  1996),  but  regularities  between 
different  language  classihcations  can  be  found  in  some  online  resources 
(Dang  et  ah,  1998),  (Dorr  and  Jones,  1999),  (Olsen  et  ah,  1998). 

Our  work  extends  the  techniques  described  by  (Palmer  and  Wu, 
1995),  which  used  a  concept  space  to  produce  a  hierarchical  organiza¬ 
tion  of  Chinese  verbs.  We  adopt  a  technique  that  is  similar  in  flavor  to 
the  intersective-class  approach  of  (Dang  et  ah,  1998),  with  the  following 
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extensions:  (1)  The  constrnction  of  an  entry  for  all  EVCA  verbs — 
pins  those  in  the  enhanced  LVD — rather  than  a  small  set  of  verbs  (the 
break  class);  (2)  The  provision  of  a  thematic-role  based  hlter  for  a  more 
rehned  version  of  verb-class  assignments;  (3)  Concept  alignment  across 
two  different  langnage  hierarchies  (Chinese  and  English)  rather  than 
one;  (4)  Mappings  between  Chinese  and  English  thematic  roles;  and  (5) 
Hooks  into  WordNet  1.6  senses  for  both  langnages.  We  view  this  work 
to  be  a  signihcant  enhancement  to  cnrrent  efforts  in  resonrce  bnilding 
for  mnltilingnal  applications  snch  as  the  PropBank  effort  (Palmer  et  ah, 
2001). 

Onr  approach  to  mapping  across  semantic  hierarchies  involves  ex¬ 
traction  of  candidate  translations  from  Optilex  for  each  of  the  Chinese 
verbs  occnrring  in  HowNet.  We  then  create  links  between  Chinese 
concepts  and  English  classes  nsing  thematic-role  mappings  between 
HowNet  entries  and  LVD  entries.  Each  Chinese- English  link  is  snbse- 
qnently  associated  with  a  sense  from  WordNet  1.6  (Miller  and  Fell- 
banm,  1991),  (Fellbanm,  1998). 


L\'D  ( ’Ijtses 
[Dorr  200 1 J 


[MRM  -  (’FTA) 


WoidNet  1 .6  Euikt: 
<y-roles 


[Millei.VFellbauni.  19^1 
FelUraum.  D>9S] 


•  HowNet  ( 'onceptj; 


[Dong.  1988. 
i  Mng,  :(K)()j 


Figure  1.  Relation  Between  Existing  Resources  and  New  Mappings 


Fignre  1  illnstrates  the  relation  between  existing  resonrces  and  the 
mappings  we  prodnced.  Solid  lines  represent  pre-existing  mappings; 
dotted  lines  are  ones  resnlting  from  the  application  of  onr  techniqnes. 
The  most  critical  of  these  is  the  one  labeled  0-roles  (shorthand  for 
“thematic  roles”),  which  associates  LVD  classes  with  HowNet  Con¬ 
cepts.  The  remaining  two  dotted-line  mappings  are  “transitive  closnre” 
byprodncts  of  the  other  mappings:  Once  the  thematic-role  mapping 
associates  LVD  verbs  with  HowNet  verbs,  each  HowNet  verb  is  asso¬ 
ciated  with  Optilex-based  Enghsh  glosses  (translations)  and  WordNet 
1.6  Senses.  Section  4.3  provides  more  details  abont  how  the  correspon¬ 
dences  in  Fignre  1  are  derived.  Next,  we  describe  the  nse  of  onr  lexical 
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representations  as  the  basis  of  the  embedded  MT  modnles  nsed  in  onr 
CLIR  system. 


3.  Embedded  MT:  Lexical  Knowledge  and  Translation 

Approach 

One  of  the  types  of  knowledge  that  mnst  be  captnred  for  nse  in  MT 
applications  is  lingnistic  knowledge  at  the  level  of  the  lexicon,  which 
covers  a  wide  range  of  information  types,  snch  as  verbal  snbcategoriza- 
tion  for  events  (e.g.,  that  a  transitive  verb  snch  as  “hit”  occnrs  with  an 
object  nonn  phrase),  featnral  information  e.g.,  that  the  direct  object 
of  a  verb  snch  as  “frighten”  is  animate),  thematic  information  (e.g., 
that  “John”  is  the  agent  in  “John  hit  the  ball”),  and  lexical- semantic 
information  (e.g.,  that  spatial  verbs  snch  as  “throw”  are  conceptnally 
distinct  from  verbs  of  possession  snch  as  “give”).  By  modnlarizing  the 
lexicon,  we  treat  each  information  type  separately,  thns  allowing  ns  to 
vary  the  degree  of  dependence  on  each  level,  so  that  we  can  address 
the  qnestion  of  how  mnch  knowledge  is  necessary  for  the  snccess  of  the 
particnlar  NLP  application.  We  focns  on  lexical- semantic  knowledge 
for  the  remainder  of  this  section. 

3.1.  Lexical  Conceptual  Structure 

The  most  intricate  component  of  lexical  knowledge  is  the  lexical-semantic 
information,  which  is  encoded  in  the  form  of  Lexical  Conceptnal  Strnc- 
tnre  (LCS)  as  formnlated  by  Dorr  (Dorr,  1993),  (Dorr,  1994)  based 
on  work  by  Jackendoff  (Jackendoff,  1983),  (Jackendoff,  1990).  This 
representation  is  nsed  in  the  NLP  component  of  an  implemented  for¬ 
eign  langnage  tntoring  system  (Dorr  et  ah,  1997)  and  an  interlingnal 
machine  translation  system  (Olsen  et  ah,  1998). 

The  LCS  approach  views  semantic  representation  as  a  snbset  of  con¬ 
ceptnal  strnctnre,  the  langnage  of  mental  representation,  as  in  (Jack¬ 
endoff,  1983),  (Jackendoff,  1990).  This  approach  inclndes  types  snch 
as  Event  and  State,  which  are  specialized  into  primitives  snch  as  GO, 
STAY,  BE,  GO-EXT,  and  ORIENT.  We  add  a  manner  component 
[Manner  JOG-flNGLY]  to  distingnish  among  verbs,  e.g.  run,  walk,  and 
jog.  The  fnll  representation  for  .Jejhn  jogged  to  school  is  therefore  the 
representation  below,  ronghly  ‘John  went  to  the  school  by  jogging’.^ 

^  Note  that  this  representation  of  the  snrface  sentence  does  not  inclnde  the 
FROM  component  shown  in  Fignre  8  since  there  is  no  from  phrase  in  this  particnlar 
example. 


output-paper.tex;  1/10/2001;  15:22;  p.5 


6 


B.  Dorr,  et  al. 


[Event  GO  Log 

([Thing  JOHN], 

[Path  TOl  oc 

([Thing  JOHN], 

[Position  ATloc  ([Thing  JOHN],  [xhing  SCHOOL])])], 

[Manner  JOG  +  INGLY])] 

Given  that  we  have  mapped  the  HowNet  entries  to  LVD  entries,  we 
are  able  to  prodnce  the  LCS’s  for  Chinese  in  the  same  way  that  we 
prodnce  entries  for  English.  Fignre  2  presents  the  resnlt  of  bnilding  an 
LCS  entry  for  the  Chinese  verb  ^14  {to  touch) ^ 

(DEFINE-WQRD 

iDEFJiJQRD  "touch" 

;CLASS  "47.87.f" 

:THETA_RaLES  "_th_Loc" 

:liJN_SENSE  (00820743  01832678  00820504) 

iLANGUAGE  ENGLISH 

:LCS 

(be  loc  (*  thing  2)  (at  loc  (thing  2)  (*  thing  11))) 

(touch+ingly  26)) 

Figure  2.  Lexicon  Entry  for  touch  in  LVD  Class  47. 8. f 

The  nnmber  47. 8. f  is  a  snbcase  of  a  Levin-based  class  for  “Verbs  of 
Contignons  Location.”  The  thematic  grid  _th_loc  indicates  that  the 
theme  and  the  location  are  both  obligatory  and  shonld  be  annotated  as 
snch  in  the  instantiated  LCS.  This  list  strnctnre  recnrsively  associates 
logical  heads  with  their  argnments  and  modihers.  The  logical  head 
is  represented  as  a  primitive/held  combination,  e.g.,  BEloc  is  repre¬ 
sented  as  (be  loc  .  .  .).  The  argnments  for  BE  are  (thing  2)  and 
(at  loc  . . . ) . 

The  variables  in  the  representation  map  between  LCS  positions  and 
their  corresponding  thematic  roles.  In  the  LCS  framework,  thematic 
roles  provide  semantic  information  abont  properties  of  the  argnment 
and  modiher  strnctnres.  In  the  example  above,  the  nnmbers  2,  If, 
and  26  correspond  to  the  roles  agent  (th),  theme  (loc),  and  manner 
(manner),  respectively.  These  nnmbers  enter  into  the  constrnction  of 
LCS  entries:  they  correspond  to  argnment  positions  in  the  LCS  tem¬ 
plate  (extracted  nsing  the  class /grid/ verb  specihcation).  Information 
is  hlled  into  the  LCS  template  nsing  these  nnmbers,  conpled  with  the 
thematic  grid  tag  for  the  particnlar  word  being  dehned. 

^  The  English  translation  is  not  nsed  dnring  the  MT  process;  we  store  it  in  lexical 
entries  for  convenience  only  (i.e.,  readability  of  the  lexicon  by  English  speakers). 
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3.2.  LCS-Based  Machine  Translation 


Figure  3.  Translation  of  a  Chinese  sentence  into  English 


Our  approach  to  machine  translation  is  interlingual,  where  the  target- 
language  lexicon  is  searched  for  appropriate  lexical  items  matching  a 
LCS  representation.  A  screen  snapshot  of  a  translation  by  our  system  on 
a  Chinese  example  is  shown  in  Figure  3.  This  translation  is  more  fluent 
than  its  literal  (gisted)  equivalent:  Our  Foreign_Econo'niic_Trade_Ministry 
spokesperson  lodge  stern  protest.  We  describe  the  analysis  and  gen¬ 
eration  modules  of  this  system;  it  is  the  interlingua  and  generation 
modules  that  we  have  embedded  in  our  CLIR  system  (to  be  described 
in  Section  3.3). 

Analysis  in  our  MT  system  rehes  on  an  in-house  parser  called  REAP 
to  produce  English  parse  trees  on  a  large  scale,  and  Chinese  on  a  smaller 
scale.  One  of  the  benehts  of  this  parser  is  its  ease  of  portability  to 
new  languages  (Weinberg  et  ah,  1995).  The  parser  output  is  semanti¬ 
cally  analyzed,  producing  an  LCS  representation,  which  serves  as  the 
inter  lingua. 

Generation  from  the  LCS  is  achieved  by  means  of  a  system  called 
Oxygen  (Habash,  2000),  a  variant  of  Nitrogen  (Langkilde  and  Knight, 
1998a),  (Langkilde  and  Knight,  1998b),  (Langkilde  and  Knight,  1998c) 
that  uses  our  own  linearizer  implemented  in  Lisp  with  Nitrogen’s  sta¬ 
tistical  extraction  module  and  Nitrogen’s  morphological  generation  en¬ 
gine.  The  English  output  is  produced  by  means  of  two  steps:  lexical 
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selection  and  syntactic  realization.  Lexical  selection  involves  a  com¬ 
parison  between  LCS  components  and  abstract  LCS  frames  associated 
with  words  in  an  English  lexicon.  Syntactic  realization  re-casts  LCS- 
based  thematic  roles  as  relations  in  an  nnordered  tree  where  the  root 
is  an  event  concept  and  each  child  is  linked  by  a  relation.  Generation 
of  target-langnage  sentences  from  LCS  is  described  in  detail  in  (Dorr 
et  ah,  1998). 

Thematic-role  information  and  its  nse  in  generation  of  natnral  lan- 
gnage  translations  are  described  in  (Dorr  et  ah,  1998)  as  a  component 
of  a  precnrsor  to  the  Oxygen  system.  Specihcally,  thematic  roles  fa¬ 
cilitate  the  selection  of  appropriate  target-langnage  words.  For  exam¬ 
ple,  the  Chinese  verb  (la)  corresponds  to  a  wide  range  of  Enghsh 
translations — even  if  we  examine  only  the  verb  translations:  slash,  cut, 
chat,  pull,  drag,  transport,  move,  raise,  help,  implicate,  involve,  defe¬ 
cate,  pressgang.^  Onr  approach  provides  a  framework  for  disambigna- 
tion  of  snch  cases.  Certain  of  these  possibilities — transport  and  move — 
are  analyzed  as  one  semantic  representation  corresponding  to  thematic 
roles  (agent, theme, goal, source).  Other  possibilities — help — are  an¬ 
alyzed  as  a  different  semantic  representation  corresponding  to  thematic 
roles  (agent , theme  ,mod-poss) . 

3.3.  Embedded  MT:  Cross-Language  Information  Retrieval 

A  common  approach  to  transforming  docnments  and  qneries  in  differ¬ 
ent  langnages  into  a  common  indexing  space  for  CLIR  is  to  translate 
either  the  docnment  or  the  qneries  into  a  single  langnage  (Oard  and 
Dorr,  1996).  Dne  to  the  time  and  compntational  expense  of  transla¬ 
tion,  qnery  translation  is  often  preferred  over  docnment  translation, 
althongh  docnment  translation  often  prodnces  snperior  resnlts.  A  vari¬ 
ety  of  methods  have  been  apphed  to  translation  term  selection  to  cope 
with  the  problem  of  translation  ambignity,  where  one  sonrce  langnage 
term  translates  into  more  than  one  target  langnage  alternative  (Oard, 
1998),  (Ballesteros  and  Croft,  1997),  (Hnll  and  Grefenstette,  1996). 
These  techniqnes  inclnde  selecting  every  translation,  the  hrst  N  trans¬ 
lations  according  to  some  ranking  strategy,  and  those  that  co-occnr 
with  candidate  translation  of  other  terms  in  the  qnery. 


^  The  ambiguity  in  the  word  JlZ.  (la)  can  often  be  resolved  if  it  is  combined 

with  other  characters.  For  example,  m  (la  che)  unambiguously  means  pull  a 
cart.  However,  since  object  dropping  is  a  frequently  phenomenon  in  Chinese,  it 
is  not  uncommon  for  verbs  like  ’la’  to  appear  without  an  argument  that  easily 
disambiguates  the  word.  Thus,  our  approach  must  allow  for  multiple  possibilities  in 
the  lexicon. 
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The  technique  we  adopt — LCS  Query  Translation  (LQT) — uses  the 
same  representation  that  serves  as  the  interlingua  in  our  MT  system. 
LQT  employs  the  LCS  MT  approach  to  transform  the  user’s  query 
into  the  document  language  for  information  retrieval.  In  our  current 
system,  we  use  a  structured  syntax  interface,  called  MADLIBS  (Mary¬ 
land  Action  Detection  /  Language- Independent  Browsing  and  Search), 
to  ensure  that  the  user’s  query  is  fuUy  analyzable  for  apphcation  of 
LQT.  Specihcally,  for  each  word  in  the  LCS  lexicon  we  produce  a 
simple  “composed”  LCS  for  each  thematic  role  structure  associated 
with  the  word,  instantiating  each  role  position  with  a  dummy  lexical 
entry,  e.g.  “someone-1”  or  “something-2”.  We  then  convert  this  version 
of  the  LCS  into  a  template  for  user  input,  by  generating  a  syntactically 
correct  surface  sentence  realization  using  the  Oxygen  module  of  the  MT 
system  described  above.  We  now  have  a  mapping  from  surface  forms 
to  the  interhngual  structures  used  in  our  LQT  approach. 

The  user  interface  for  this  system  is  illustrated  in  Figure  4.  The 
positions  in  the  sentence  realization  that  correspond  to  the  thematic 
roles  appear  as  boxes  for  free- form  user  input.  The  interface  allows 
querying  of  either  English  or  Chinese  documents;  we  will  focus  on  the 
cross-language  variant  in  the  remainder  of  this  discussion. 

The  LCS  representation  described  in  Section  3.1  is  an  important 
component  of  our  LQT  approach  to  IR  (see  Figure  5).  In  particular,  we 
have  embedded  a  structural  graph-matching  routine  that  is  used  both 
for  full-blown  MT  (where  target-language  sentences  are  generated)  as 
well  as  selection  of  terms  for  cross-language  retrieval  (where  bags  of 
words  are  produced  as  query  terms). 

To  construct  the  query,  the  surface  template  retrieves  its  underly¬ 
ing  LCS  structure,  complete  with  the  input  words  Filing  the  thematic 
role  positions.  This  correspondence  between  surface  form  and  thematic 
structure  performs  an  initial  phase  of  sense  disambiguation,  identifying 
the  subset  of  possible  senses  with  this  argument  structure.  To  per¬ 
form  translation  of  the  query,  we  apply  structural  matching  of  the 
query  against  a  database  of  LCS  structures.  Depending  on  the  lan¬ 
guage  choice,  we  consult  different  databases  and  return  words  with 
corresponding  LCS  structure.  The  structural  matching  also  directly 
exploits  thematic  role  information  associated  with  entries  in  the  LCS 
database  for  the  language  of  choice. 

The  system  permits  two  forms  of  matching:  exact  and  relaxed,  se¬ 
lected  with  the  pull-down  item  in  the  interface.  Exact  match  compares 
both  structure  and  manner  constants  and  relies  only  on  the  database 
selection.  Relaxed  match  performs  a  second  phase  of  processing  after 
the  structural  match,  employing  the  WordNet  correspondences  pro¬ 
duced  by  the  thematic  hierarchy.  This  method  computes  similarity 
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MADLIBS  System  Interfece 


prov i d« 


prostitute 
prostrate 
protect 
protest 
protrude 
I proue 


provoke 

proupl 


someone  provided  something 

someone  provided  something  to  the  something 


China  provided  Taiwari. 


with  quake  aitj 


Submit  I 


English  -  English  _i  Exact  Match  ^  Systran  ^ 


Figure  4-  Structured  Syntax  Input  Interface 


between  the  original  term  and  the  candidate  translations  retnrned  by 
the  strnctnral  match  bnilding  on  Resnik’s  (Resnik,  1995)  techniqne 
for  compnting  taxonomic  similarity.  In  all  cases,  the  top  N  scoring 
candidates  are  retnrned. 

We  cnrrently  perform  no  additional  analysis  of  nonn  phrases  entered 
in  the  thematic  role  position,  thongh  a  fnller  treatment  of  nominalized 
events  is  planned.  We  instead  apply  a  basic  matching  techniqne,  nsing  a 
lexicon  bnilt  from  an  English- Chinese  term  list  provided  by  the  Lingnis- 
tic  Data  Consortinm  (LDC)  angmented  with  the  resnlt  of  inverting  the 
Optilex  lexicon  for  words  with  single  word  translations,  for  the  Chinese 
docnment  case.®  Again,  we  select  the  top  N  translation  alternatives. 

The  translation  terms  identihed  by  strnctnral  match,  taxonomic 
match  and  word-for-word  translation  form  a  bag  of  words  that  com¬ 
prise  the  qnery  to  an  information  retrieval  system.  We  nse  a  version  of 
the  Inqnery  3.1pl  information  retrieval  system  from  the  University  of 
Massachnsetts,  modihed  for  2-byte  encodings  of  Chinese  characters. 


See  littp://www.ldc. upenn.edu  for  information  on  LDC. 
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SL  Sentenc* 


_ i _ 

Mmipar 

Parameterized  Parner 


[Lm.  l'«8| 


A 

LCS  ' 


Bag  of  Word*  for  IR  TL  Translation 


Figure  5.  MT  and  CLIR  Applications  that  use  Multilingual  Resources 


1 

99n0/07 

othex  day,  Fuzhou  resident 

Glossed 

Svstran  MT 

Chinese 

2 

99109/27 

Fujian  Province 

Glossed 

Svstran  MT 

Chinese 

3 

99109/27 

on  September  26th,  Taiwan 

Glossed 

Svstran  MT 

Chinese 

4 

99/09/27 

Taiwan  once  more  occurs  7 

Glossed 

Systran  MT 

Chinese 

S 

99/09/27 

unceasing  typhoon 

Glossed 

Svstran  MT 

Chinese 

6 

99/09/27 

motherland  mainland 

Glossed 

Svstran  MT 

Chinese 

7 

99/10/07 

compatriot  hisses 

Glossed 

Systran  MT 

Chinese 

$ 

99/10/04 

Red  Cross  Society  of  China 

Glossed 

Svstran  MT 

Chinese 

9 

99/09/27 

each  kind  of  activity 

Glossed 

Svstran  MT 

Chinese 

10 

99/07/07 

eulogizes  Taiwan 

Glossed 

Svstran  MT 

Chinese 

1  2  3  4  S 

6  7  8 

9  10  11 

Next  »> 

Figure  6.  Selection  Interface 


Results  are  displayed  interactively  as  well  (see  Figure  6),  in  the  user’s 
choice  of  source  document  language,  Systran  machine  translation,  or 
“gist”,  a  word-for-word  translation  technique  that  provides  multiple 
ranked  alternate  translations  (see  Figure  7).  Additional  details  about 
the  use  of  semantic  representations  in  the  CLIR  system  are  given  in 
(Dorr  and  Katsova,  1998),  (Levow  et  ah,  2000). 
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Tife  Tsdwaji  once  more  occurs  7 
Date  99/09r^7 


(taiwan)(  again,  one  more  lime,  onemore)(i9,  appears,  happen)?  (year,  leval,  class)  (over,  above, 
upvards ) (earthquake,  earthquakes,  cataclysm)  bej;3bao4  ^eqing{^,  beijing,  peking)9  (months, 
month,  round  )2  6  (time,  day.  date )  (state,  inloimation,  question )  (evidence,  proof,  occupy )  (our 
country,  my  c  ountry )  (earUtquake,  earthquakes,  c  at&clysm )  (me,  your,  stage )  (network,  net,  netting 
)  (determine,  measure,  deteimination),  (now,  today,  to-day  )7  (wiien,  time,  present  )5  2  (ri^,  part, 
point )  ( (beqing^},  leijing,  peking )  (time,  period,  date )) ,  (in,  on,  at )  (taiwan  i(  visit,  realize,  reduce 
)  (woman,  spend,  cotton )  (lotus  )( to,  most,  until )  (nantou )(  one,  if,  first )  (and,  wi^,  kave )  (( 
epicenter,  epifocus.  hypocentrum)(be  situated  {})( north  latitude  )23  .  9  (tine,  idiought,  degree), 
(master,  eist,  host)  (throui^,  after,  stand  )1  2  1  .1  (time,  thDught,  degree ))  'again,  one  more  time, 
one  more )  (iq>,  appears,  happen  )7  (>oar,  level,  class )  (over,  above,  upwards )  (earthquake, 
earliquakes,  cataclysm ),  (shock,  shake,  lightning )  (year,  level,  class )  (to,  f cr,  be  )7  1  (year,  level, 
class  )(.)( (iwhttrte  detailed,  thorough )  (report,  news  report)(see,  view,  meet) (but,  still,  order 
)(five,  fifth,  five-year  plan )  (board,  page,  version))  (overseas  e^tion)  (  1 99  9  ( year,  period,  age)C 
9  (raonths,  month,  round  )2  7  (time,  day,  date )  (hut,  still,  order  )1  (board,  page,  version ))  (man, 
people,  help )  (subject,  popular,  mankind )  (time,  day,  date )  (report,  newspaper,  respond )  (she,  group, 
loc^) (board,  page, version) (ri^, power,  balance) (place,  actually, location) (haw, you,  own),  (not. 
wei,  1  -3  p.n. )  (throng,  after,  stand)  (^ve,  teach,  avard )  (ri^,  power,  balar.ce )  (stand,  endure,  ban 
)  (to,  only,  stop )  (again,  return,  answer )  (nuike,  system,  control )  (cm,  might,  perhaps  )(fcncnd,  straight, 
build )  (be,  live,  stand )  (^ass,  passes,  mirror )  (look,  seem,  picture ) . 


Figure  7.  Presentation  Interface:  “Gisted”  Format 


4.  Automatic  Construction  of  MT  Resources 


We  now  turn  to  the  application  of  antoniatic  techniqnes  for  constrnc- 
tion  of  lingnistic  resonrces  for  onr  embedded  MT  modnles,  specifically, 
the  Chinese  LCS  resonrces  nsed  in  onr  LQT  approach.  As  ontlined  in 
Section  2,  we  leverage  off  HowNet  and  LVD,  nsing  thematic-role  infor¬ 
mation  to  create  links  between  Chinese  concepts  and  English  classes. 
We  describe  these  two  resonrces  in  more  detail  and  then  present  de¬ 
scribing  onr  techniqnes  for  mapping  between  them  antomatically. 

4.1.  HowNet  Conceptual  Database 

HowNet  is  an  on-line  conceptnal  common-sense  knowledge  base  that 
contains  hierarchical  information  relating  concepts  to  the  associated 
Chinese  word.  Onr  focns  is  on  the  verb  hierarchy,  which  has  the  strnc- 
tnre  shown  in  Table  I. 

The  nnmber  labels  given  here  are  onr  own;  we  nse  these  for  indicating 
the  level  of  each  concept  in  the  HowNet  database.  Note  that  the  highest 
two  concepts  in  the  verb  hierarchy  are  “static”  (V.l)  and  “act”  (V.2). 
These  correspond,  respectively,  to  verbs  snch  as  ( become  nnder  the 
“static”  node  V.l.  1.1)  and  (start  nnder  the  “act”  node  V.2. 1.1). 
The  levels  go  mnch  deeper  than  these,  with  the  lowest  ones  at  8  levels 
deep,  e.g.,  V.l. 2. 1.6. 3. 3. 1.15  itch. 
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Table  I.  HowNet  Verb  Hierarchy 


V.l  |static| 

V.2  jactj 

V.2. 4  jAlterStatej 

V.1.1  1  relation! 

V.2.1  jActGeneralj 

V.2. 4.1  jAlterPhysicalj 

V.l. 1.1  jisaj 

V.2. 1.1  jstartj 

V.2. 4. 2  jAlterStatcNormalj 

V.l. 1.2  jpossessionj 

V.2. 1.2  jdoj 

V.2. 4. 3  jAlterStateGoodj 

V.l. 1.3  jcomparisonj 

V.2. 1.3  jDoNotj 

V.2. 4. 4  jAlterQuantityj 

V.l. 1.4  jsuitj 

V.2. 1.4  jCeasej 

V.2. 4. 5  jAlterStatcBadj 

V.l. 1.5  jinclusivej 

V.2. 1.5  jWaitj 

V.2. 4. 6  jAlterMentalj 

V.l. 1.6  jconnectivej 

V.2. 2  jActSpecihcj 

V.2. 5  jAlterAttributej: 

V.l. 1.7  jCauscResultj 

V.2. 2.1  jAlterGeneralj 

V.2. 5.1  jMakeHigherj 

V.l. 1.8  jTimeOrSpacej 

V.2. 2. 2  jAlterSpecihcj 

V.2. 5. 2  jMakeLowerj 

V.l. 1.9  jarithmeticj 

V.2. 3  jAlterRelationj 

V.2. 5. 3  jAlterAppearancej 

V.l. 2  jstatej 

V.2. 3.1  jAlterlsaj 

V.2. 5. 4  jAlterMeasurementj 

V.l. 2.1  jStatePhysicalj 

V.2. 3. 2  jAlterPossessionj 

V.2. 5. 5  jAlterPropertyj 

V.l. 2. 2  jStateMentalj 

V.2. 3. 3  jAlterComparisonj 

V.2. 6  jMakeActj: 

V.2. 3. 4  jAlterPitnessj 

V.2. 6.1  jCauscToDoj 

V.2. 3. 5  jAlterlnclusionj 

V.2. 6. 2  jCauscNotToDoj 

V.2. 3. 6  jAlterConnectionj 
V.2. 3. 7  jAlterCauscResultj 
V.2. 3. 8  jAlterLocationj 

V.2. 3. 9  jAlterTimePositionj 

V.2. 6. 3  jusej 

HowNet  contains 

815  verb  concepts  altogether.^  Each  HowNet  con- 

cept  is  associated  with  a  thematic-role  specihcation.  For  example,  the 
verb  “cure”  has  the  thematic-roles  (agent, patient, content, tool). 
Consider  the  sentence  The  doctor  cured  the  man  of  pneumonia  with 
antibiotics.  The  roles  in  the  specihcation  have  the  following  binding, 
respectively,  for  this  sentence  :  doctor,  man,  pneumonia,  antibiotics.^ 

The  thematic-role  specihcations  are  nsed  for  prioritizing  candidate  HowNet- 
LVD  associations,  as  will  be  described  below. 


^  We  have  excluded  the  106  HowNet  concepts  that  are  not  associated  with  any 
Chinese  words;  these  are  “higher  level”  conceptual  nodes  with  no  Chinese  realization 
(e.g.,  V.l  |static|). 

®  Thematic-role  specihcations  and  their  use  in  generation  of  natural-language 
translations  are  described  further  in  (Dorr  et  al.,  1998). 
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4.2.  LCS  Verb  Database 

The  first  public  release  of  the  LCS  Verb  Database  (LVD)  is  now  avail¬ 
able  for  research  purposes  (Dorr,  2001).®  The  verbs  in  this  resource 
were  borrowed  initially  from  the  publicly  available  online  EVCA  index 
EVCA  classification.^®  While  Levin’s  original  EVCA  work  provides  a 
unique  and  extensive  catalog  of  verb  classes,  it  does  not  define  the 
underlying  meaning  components  of  each  class.  One  of  the  main  contri¬ 
butions  of  our  work  is  that  it  provides  a  relation  between  Levin’s  classes 
and  meaning  components  as  defined  in  the  LCS  representation  (de¬ 
scribed  above  in  Section  3.1),  thematic  role  information,  hand-tagged 
WordNet  synset  numbers. 

We  built  our  database  by  subdividing  the  classes  into  a  more  refined 
set  (Dorr  and  Olsen,  1996),  extending  this  set  to  include  26  new  classes 
(Dorr,  1997a),  constructing  an  LCS  representation  for  each  entry  (Dorr, 
1997b),  assigning  WordNet  senses  to  existing  verbs  (Dorr  and  Jones, 
1999),  and  adding  3000  WordNet-tagged  verbs  (Green  et  al.,  2001a), 
(Green  et  al.,  2001b).  Levin’s  original  database  contained  3024  verbs  in 
192  classes  numbering  between  9.1  and  57 — a  total  of  4186  verb  entries. 
The  augmented  database  contains  4432  verbs  in  492  classes  with  more 
specific  numbering  (e.g.,  “51.3.2.a.ii”)  and  additional  class  numbers  for 
new  classes  (between  000  and  026) — a  total  of  11000  verb  entries. 

Each  semantic  class  contains  a  set  of  verbs  that  are  related  by  “se¬ 
mantic  structure”  as  defined  in  the  LCS.  For  example,  all  the  verbs  in 
the  semantic  class  of  “Run”  verbs  have  the  same  semantic  structure 
but  vary  in  their  semantic  content  (for  example,  run,  jog,  walk,  zigzag, 
jump,  roll,  etc.).  The  lexicon  entry  for  the  English  verb  jog  in  the 
“Run”  class  is  shown  in  Figure  8.  This  entry  includes  the  root  form  of 
the  word,  its  semantic  class  and  WordNet  sense  (for  verbs),  its  thematic 
roles  (clustered  into  a  grid),  and  its  LCS  representation. 

Mapping  English  thematic  roles  to  their  Chinese  counterparts  is 
the  primary  aid  in  selecting  the  appropriate  verb  class(es)  in  the  LCS 
Database  for  each  concept  in  HowNet.  The  next  section  demonstrates 
that  it  is  possible  to  produce  a  lexicon  by  associating  709  Chinese 
HowNet  concepts  with  492  classes  from  the  LCS  Database,  with  a  clear 
concept-to-class  correspondence  in  a  large  majority  of  the  cases. 


®  In  the  work  of  (Green  et  al.,  2001a),  (Green  et  al.,  2001b),  the  LVD  resonrce 
was  referred  to  as  Levin-f.  We  have  since  adopted  the  more  precise  name  “LCS  Verb 
Database”  since  the  resonrce  is  not  a  simple  extension  to  EVCA,  bnt  an  entirely 
new  database  with  rich  semantic  strnctnre,  thematic  information,  WordNet  senses, 
and  a  signihcantly  more  comprehensive  set  of  verbs. 

The  EVCA  index  may  be  fonnd  at: 

ftp:  / /lingnistics.  archive  .nmich  .edn /lingnistics  /  texts/indices/  evca93  .index. 
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(DEFINE-WQRD 

iDEFJiJQRD  "jog" 

:CLASS  "51.3.2.a.ii" 

:THETA_RQLES  "_th,src()  jgoalO" 

:liJN_SENSE  (01315785  01297547) 

iLANGUAGE  ENGLISH 

:LCS 

(event  go  loc  (*  thing  2) 

((*  path  from  3)  loc  (thing  2)  (position  at  loc  (thing  2)  (thing  4))) 
((*  path  to  5)  loc  (thing  2)  (position  at  loc  (thing  2)  (thing  4))) 
(manner  jog+ingly  26)) 

Figure  8.  Lexicon  Entry  for  jog  in  LVD  Class  51.3.2.a.ii 


4.3.  Mapping  Between  Chinese  HowNet  and  English  LVD 

Our  technique  for  mapping  between  Chinese  HowNet  concepts  and 
English  LVD  classes  involves  associating  HowNet  thematic  roles  with 
those  in  the  LCS  Database.  Each  HowNet  concept  (and  each  LCS)  is 
paired  with  a  list  of  thematic  roles,  which  we  call  a  thematic  grid. 
For  example,  the  HowNet  concept  |Cure|  is  paired  with  the  0-role  grid 
(agent  .patient  .content  .tool) ,  as  in  The  doctor(agent)  cured  the 
man(patient)  of  pneumonia(content)  using  antibiotics  (tool).  The  corre¬ 
sponding  grid  in  our  LVD  database  is  (ag .th.mod-poss (of ) ) .  These 
roles  are  associated  with  the  hrst  three  noun  phrases  in  the  sentence; 
the  fourth  noun  phrase  would  be  considered  a  modiher  and,  thus,  is 
not  in  the  LVD  grid.  Although  the  HowNet  and  LVD  roles  are  not  in  a 
one-to-one  correspondence,  they  can  still  be  used  for  a  “closest  match” 
prioritization  of  candidate  HowNet-LVD  associations,  as  we  will  see 
shortly. 

4.3.1.  HowNet-LVD  Mapping  Tasks 

There  are  three  top-level  tasks  involved  in  mapping  Chinese  HowNet 
concepts  to  and  English  LVD  classes: 

(1)  Produce  all  possible  English  Optilex  glosses  (translations)  for  all 
12342  Chinese  verbs  in  HowNet  and  associate  each  of  the  resulting 
41,324  Chinese-English  pairs  with  one  or  more  of  the  709  HowNet 
concepts.  [“HowNet  Concept  +  Word  +  Glosses”  in  Figure  9.] 

Example:  The  multiply  ambiguous  Chinese  verb  L-  (la)  has  several 
different  Optilex  glosses  (transport,  pull,  help,  raise,  cut,  chat, 
defecate,  slash,  drag,  move,  implicate,  involve,  pressgang)  and 
is  associated  with  multiple  HowNet  concepts:  |Transport|,  |Pull|, 
|Help|,  |lnclude|,  |Force|,  |Talk|  |Excrete|,  |Attract|,  and  |Recreation| 


output-paper.tex;  1/10/2001;  15:22;  p.l5 


16 


B.  Dorr,  et  al. 


ResourceN 


Program  F'low 


H«wNct 

1 

K. 

H^wNh  (  4*cM'«fi<  *  U  ar4  *  Gbttef 

'Ti  '  i-il  l*d|.  MB* '-it 'krt 'lpf»  *1^  !■ 

'  TrMf«n*»i  '  ^  'It  f'il  t»»* ’'<■  hit  1 

l>4>i 

«i 

i : 

^3.  hHf' nK» ‘-ti -'Mf 


r 


LV1> 

II  !•  «  (■'<troni«tiwpail 

II  4«» 

I '  1  J0 

.f'  ^  lfb:<a>*M9r 

.'11**  ‘yj'.*#!*  irih)*'’* 

''<■«*  j«r.»  aiUi  nud-^MV 

4.1  I 


LVT)  -W«r4NH  Mi^>f»big 

i!  -^*11  Lt*»^  ■•.ni;^v.4j 

m4jA»i|)  -tui >;  ^.ui ■ 
«iJll4lt.<V..l|l^l'N' 

'\  1  **’jf  -»n]niyJi<M;Mi440r- 

I  4i  1  Jk»iK«li  -illJ'  1  r 


j  ('<hm  v|*4  a  Rfftlet  LM>  C*I«m 

^  M  :  p-iJ ; .  4...  fcr4»' P  4  :■ 

'  "'*•«(■’«*  '  ^  Ub.ko'flju.it*- 

"rnt«f«4*  'll'  piii  i  ■  4  ..■  Mp  i  <4*i 


Rarikiai  hj  LM)  CUii  A  HMnatic  M^>|>fai| 

!  »v<in  r 

;  - 

«f*ii  -dtitvT^-MhKi^suM 

:  iriRA^it'ti 

tntdfk.’rt  >4Kj  1  ]  ; 

I  4**’ *K.<WttiL«OL  k-r'tfk.iRfii  -*^*4' 

n _ 

I 

' 

I  tul*  F*#*F  i  f  4  J* 

^  i|4J''iril  it  Lil  '•  T.il  ■ 

I  aiiiiai  p«i*Mii<rn.k>cnt.iliiKijL«4 

I  'Mil  1  I  I* 


Figure  9.  Resources  and  Processing  Stages  for  Mapping  Chinese  HowNet  and 
English  LVD,  including  linkages  to  WordNet  Senses 


(2)  Associate  each  verb-to-concept  candidate  with  one  or  more  of  the 
492  LVD  classes — forming  an  average  of  2  thonsand  verb-to-class 
entries  per  HowNet  concept  (on  the  order  of  1  million  verb-to-class 
candidates,  total)4^  [“HowNet  Concept  &  Roles  /  LVD  Class”  in 
Fignre  9] 

Example:  The  Chinese  verb  Jli  (la)  is  associated  with  22  LVD 
classes:  Admire  (31. 2. b,  implicate^  involve)',  Amnse  (31.1.b,  cut, 
move,  transport)',  Braid  (41.2.2,  cut)'.  Breathe  (40.1.2,  defecate)', 
Bnild  (26.1. a,  cut)'.  Carry  (11. 4. i,  carry,  drag,  pull)'.  Chitchat  (37. 6. a, 
chat)'.  Crane  (40.3.2,  raise)',  Cnt  (21.1. a,  cut,  slash)',  Cnt  (21.1.d, 
cut)',  Eqnip  (13.4.2,  help)'.  Force  (12.a.ii,  pull)'.  Get  (13. 5.1. a,  pn//); 
Grow  (26.2.a.ii,  raise)',  Hnrt  (40.8.3,  cut,  pull)'.  Meander  (47. 7. a, 
cut)'.  Play  (009,  pawn)',  Pnt  (9. 4. a,  raise);  Search  (35. 2. a,  drag)'. 
Send  (11.1,  convey,  ship,  smuggle,  transport)'.  Slide  (11. 2. b,  move)', 
Spht  (23. 2. b,  cut,  pull). 


The  Chinese  verbs  are  additionally  associated  (for  free)  with  WordNet  senses 
from  our  previously  tagged  LVD  verbs.  More  details  are  given  in  (Dorr  et  ah,  2000). 
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(3)  For  each  HowNet  concept,  partition  the  associated  Chinese-English 
pairs  into  gronps  whose  English  glosses  correspond  to  LVD  classes. 
This  reqnires  three  steps: 

a.  Order  the  candidate  LVD  classes  so  that  the  highest-ranking 
classes  are  those  that  contain  the  highest  nnniber  of  Enghsh 
verbs  matching  the  Optilex  glosses.  [“Ranking  by  LVD  Class” 
in  Fignre  9] 

b.  In  cases  where  a  tie-breaker  is  needed,  reorder  the  candidate 
LVD  classes  according  to  the  degree  to  which  the  thematic- 
role  specihcation  in  HowNet  concept  matches  that  of  LVD 
class.  The  matching  procednre  relies  on  correlations  derived 
from  approximately  200  seed  mappings.  A  snbset  of  these 
mappings  are  shown  in  Table  II. [“Ranking  by  Thematic 
Mapping”  in  Fignre  9] 

c.  For  each  Chinese-English  entry  associated  with  the  HowNet 
concept,  assign  the  highest  ranking  candidate  LVD  class.  [“Ont- 
pnt”  in  Fignre  9] 

Example:  Two  of  the  HowNet  concepts  associated  with  the  mnlti- 
ply  ambignons  Chinese  verb  Jr  (la)  are  |Help|  and  |Transport|.  The 
0-role  specihcation  associated  with  |Help|  is  (agent  .patient ,  scope) 
(as  in  Jfjhn  helped  him  with  his  work).  This  specihcation  most 
closely  matches  that  of  Eqnip  LVD  Class  (where  Jr  (la)  is  trans¬ 
lated  as  help)  which  has  the  specihcation  _ag_th,mod-poss(with); 
thns,  the  |Help|  HowNet  concept  is  associated  with  the  Eqnip 
LVD  Class,  and  the  mapping  between  the  two  is  (agent->ag), 
(patient->th) ,  (scope->mod-poss) . 

On  the  other  hand,  the  HowNet  concept  |Transport|  is  associated 
with  the  thematic-role  specihcation 

(agent .patient ,Loc at ionIni.LocationF in) 

(as  in  .John  transported  the  goods  from  Boston  to  New  York  (west¬ 
ward)).  This  specihcation  most  closely  matches  that  of  the  Send 
LVD  Class  (where  Jr  (la)  is  translated  as  transport)]  thns,  the 
HowNet  concept  |Transport|  is  associated  with  the  Send  LVD  class, 
and  the  mapping  between  the  two  is  (agent->ag) ,  (patient->th) , 
(LocationIni->src) ,  (LocationFin->goal) . 

The  end  resnlt  is  that  the  Enghsh  glosses  associated  with  Jr  (la) 
are  hltered  down  to  help  in  the  Eqnip  semantic  class  and  transport 

The  seed  mappings  were  done  by  hand  at  a  rate  of  approximately  50  mappings 
per  hour;  these  were  verihed  by  a  native  Chinese  speaker  in  a  half  day. 
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in  the  Send  semantic  class.  The  corresponding  WordNet  senses 
are  then  assigned  from  the  hand-tagged  LVD  database — these  are 
senses  1  and  3  in  the  case  of  help  [indexed  as  01737017  and  00056138 
in  Fignre  9]  and  senses  1,2,  and  4  in  the  case  of  transport  [indexed 
as  01330495,  00994853,  01328437  in  Fignre  9]: 

•  help: 

Sense  1:  assist 
Sense  3:  aid 

•  transport: 

Sense  1:  transport 
Sense  2:  carry 
Sense  4:  send,  ship 

The  process  of  associating  LVD  classes  with  Chinese  verbs  rehes 
on  a  massive  hltering  of  spnrions  class  assignments.  For  example,  the 
|Establish|  HowNet  concept  is  nltimately  associated  with  only  two  LVD 
classes,  29. 2. c  (Characterize)  and  26. 4. a  (Create),  bnt  it  initially  had 
29  potential  LVD  class  assignments.  One  example  of  an  LVD  class  that 
was  rnled  ont  is  the  Change  of  State  class,  45. 4. a,  associated  with  the 
Optilex  translation  colonize  for  the  Chinese  verb  (zhimin).  Al- 

thongh  this  is  a  perfectly  valid  LVD  class  assignment  for  the  HowNet 
concept  I  Colonize! ,  it  is  not  appropriate  for  the  jEstablishj  HowNet 
concept.  Becanse  this  class  is  ranked  8th  for  jEstablishj — as  opposed 
to  1st  and  2nd  place  ranking  for  29. 2. c  and  26. 4. a,  respectively — this 
assignment  is  rnled  ont  by  onr  algorithm. 


5.  Compensating  for  Resource  Deficiencies 

The  techniqnes  described  in  the  previons  section  creates  a  bridge  be¬ 
tween  entries  in  the  Chinese  HowNet  conceptnal  hierarchy  and  the  LVD 
semantic  classes.  We  now  demonstrate  how  the  HowNet  thematic  role 
and  LVD  semantic  class  mappings  are  combined  to  prodnce  a  richer 
lexical  resonrce  for  mnltilingnal  applications,  with  a  focns  on  the  nse 
of  event  strnctnre  for  word  sense  disambignation. 

Recall  that  onr  approach  in  Section  4.3  consists  of  three  top-level 
tasks.  Apphcation  of  this  approach  resnlted  in  8089  LVD-classihed 
Chinese  entries — abont  43%  of  the  nnmber  of  potential  entries.  The 
histogram  in  Table  111  characterizes  the  nnmber  of  LVD  classes  reqnired 
for  coverage  of  709  HowNet  concepts.  The  majority  of  HowNet  concepts 
are  covered  by  1-4  LVD  classes,  althongh  a  small  nnmber  of  concepts 
are  represented  by  as  many  as  22. 
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Table  II.  Seed  Table  for  mapping  HowNet  Roles  into  LVD  Roles 


Hownet 

1  LVD  Roles  | 

Roles 

ag 

1  th 

1  exp 

1  goal 

iia 

1  info 

UB 

1  prop 

1  Pred 

1  Purp 

1  1  Ben 

1 

agent 

278 

1  77 

1  82 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1 

1 

1  ^ 

1  11 

1 

beneficiary 

0 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1 

cause 

0 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1 

1 

1  ^ 

1  11 

1  ^ 

content 

0 

1  81 

1  ^ 

1  ^ 

1  ^ 

1 

1  ^ 

1  20 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

contrast 

0 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

experiencer 

18 

1  82 

1  88 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

isa 

0 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

location 

0 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

manner 

0 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

partner 

0 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  11 

1  ^ 

1  ^ 

1  ^ 

p  art  of 

0 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

patient 

0 

1  122 

1 

1 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

possession 

0 

1  28 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

purpose 

0 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

range 

0 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

relevant 

15 

1 

1 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

result 

0 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

scope 

0 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

source 

0 

1 

1  ^ 

1  ^ 

1  16 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

target 

0 

1 

1  12 

1  27 

1  ^ 

1 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

Content  Product 

0 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

LocationFin 

0 

1  ^ 

1  ^ 

1  81 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

Locationini 

0 

1  ^ 

1  ^ 

1  ^ 

1 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

StateFin 

0 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

Stateini 

0 

1  ^ 

1  ^ 

1  ^ 

1 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

1  ^ 

The  remaining  10441  entries  are  accounted  for  through  the  use  of 
techniques  that  allow  us  to  compensate  for  resource  dehciencies  (e.g., 
the  lack  of  Optilex  translations  for  certain  Chinese  verbs).  These  tech¬ 
niques  allow  us  to  produce  a  more  complete  alignment  between  HowNet 
and  LVD. 

In  order  to  induce  this  enhanced  version  of  the  algorithm,  we  built  an 
LVD-based  canonical  specihcation  for  each  of  the  709  HowNet  concepts 
so  that  we  could  compensate  for  certain  types  of  resource  dehciencies. 
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Table  III.  Distribution  of  HowNet  Concepts  by  Number  of  Intersecting  LVD 
Classes 

*  of  HouNet  Concepts 
IGO 

140 

120 

100 

80 

60 

40 

20 

0 

0  5  10  15  20  25 

#  of  Intersecting  LVD  Classes 


1  LVD: 

0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11  1 

1  HowNet:  | 

4 

111 

147 

123 

104 

64 

49 

31 

22 

18 

11 

10  1 

1  LVD: 

12 

13 

14 

IS 

16 

17 

18 

19 

20 

21 

22 

1  HowNet:  | 

5 

4 

2 

0 

1 

1 

0 

1 

0 

0 

1 

The  canonical  specification  consists  of  an  class  conpled  with  its  associ¬ 
ated  prototype  verb.  These  canonical  specifications  provide  a  mapping 
between  a  HowNet  concept  and  a  LVD  class/prototype- verb  pair. 

Each  canonical  specification  was  antomaticaUy  generated  accord¬ 
ing  to  the  highest  ranking  LVD  class  nsing  steps  3. a  and  3.b  in  Sec¬ 
tion  4.3.  All  snch  specifications  were  hand- verified  (at  a  rate  of  80  per 
honr  for  709  classes).  In  most  cases,  the  prototype  verb  names  the 
HowNet  concept,  e.g.,  transport  for  the  |Transport|  HowNet  concept. 
In  other  cases — where  the  HowNet  concept  is  not  an  English  word — 
the  prototype  word  is  a  realization  of  that  concept,  e.g.,  belittle  for  the 
|PlayDown|  HowNet  concept.  A  sample  of  the  canonical  specifications 
is  given  in  Table  IV. 

We  nse  these  canonical  specifications  to  compensate  for  gaps  that 
arise  in  onr  three  online  resonrces:  (!)  LVD,  (2)  Optilex,  and  (3)  HowNet 
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Table  IV.  Sample  of  Canonical  Specifications  for  Filling 
Resonrce  Gaps 


HowNet  Concept 

Canonical  Specification 

|Transport| 

11.1  Send,  transport 

|BeNot| 

22. 2. a  Amalgamate,  oppose 

|Help| 

13.4.2  Equip,  help 

|Moisten| 

45. 4. a  Change  of  State,  facilitate 

1  Excrete! 

40.1.2  Breathe,  bleed 

jApologizej 

32. 2. a  Long,  apologize 

jPlayDownj 

33. b  Judgment,  belittle 

jNamingj 

29.3  Dub,  name 

jChoosej 

29. 2. c,  choose 

jAnnouncej 

37. 7. b  Say,  announce 

jMeanj 

37. 7. a  Say,  signify 

1  Communicate! 

37. 9. c  Advise  inform 

5.1.  LVD  Gaps 


An  LVD  gap  is  detected  when  an  Optilex  verb  gloss  for  a  Chinese  verb 
does  not  occnr  in  LVD.  When  this  occnrs,  the  canonical  specihcation  for 
the  Chinese  verb  is  antoniatically  nsed  to  assign  the  verb  an  appropriate 
LVD  class.  For  example,  one  Optilex  gloss  associated  with  the  HowNet 
concept  |Establish|  (for  the  verb  fiM  (chongjian))  is  reconstruct,  which 
does  not  occnr  in  LVD.  Onr  techniqne  associates  this  Chinese  verb 
with  the  canonical  specihcation  “29. 2. c  Characterize,  establish”  and 
the  Chinese  verb  is  then  linked  with  the  word  sense  associated  with 
establish. 

An  interesting  byprodnct  of  the  handling  of  LVD  gaps  is  that  it 
allows  ns  to  enhance  onr  LVD  resonrce  (and,  additionally,  the  original 
EVCA  index).  For  example  the  verb  reconstruct  can  now  be  added  to 
LVD  Class  29. 2. c,  on  a  par  with  the  previonsly  classihed  LVD  verb 
establish. 


output-paper.tex;  1/10/2001;  15:22;  p.21 


22 


B.  Dorr,  et  al. 


5.2.  Optilex  Gaps 

An  Optilex  gap  occurs  when  a  particular  translation  for  a  Chinese 
verb  is  missing.  For  example,  the  verb  (baibu)  has  only  one  Op¬ 

tilex  gloss:  manipulate.  However,  the  word  is  associated  with  two 
HowNet  concepts,  |Decorate|  and  |Control|.  This  gloss  is  only  appro¬ 
priate  for  the  |Control|  concept.  The  decorate  meaning  of  (baibu) 
is  omitted  in  Optilex. 

Such  gaps  are  detected  by  means  of  two  types  of  information:  (!) 
HowNet  roles  and  LVD  thematic  grid;  and  (2)  correlations  between  the 
gloss  under  question  and  other  HowNet  concepts.  In  this  particular 
example,  the  thematic  grid  for  manipulate  in  LVD  is  (ag,  exp , instr) , 
which  is  ranked  low  (ffth  out  of  28)  with  respect  to  the  roles  (agent , 
patient)  associated  with  the  HowNet  concept  |Decorate|.  By  con¬ 
trast,  this  same  LVD  class  has  a  high  ranking  (2nd  out  of  22)  with 
respect  to  the  HowNet  |Control|  concept  due  to  a  close  match  between 
(ag,  exp,  instr)  and  the  HowNet  thematic  roles  (agent , patient , 
ResultEvent) .  In  addition,  the  correlation  of  the  gloss  manipulate  is 
much  higher  for  HowNet’s  |Control|  concept  than  it  is  for  HowNet’s 
|Decorate|  concept  (4  occurrences  compared  to  0).  From  these  two  types 
of  information,  we  can  conclude  that  the  decorate  sense  of  (baibu) 
is  missing  from  Optilex.  As  in  the  case  with  LVD  gaps,  our  technique 
associates  the  Chinese  verb  with  the  canonical  specihcation  “9.8.b  FiU, 
decorate”  to  compensate  for  this  Optilex  gap. 

In  addition  to  their  usefulness  in  handling  of  gaps  in  our  lexical 
resources,  the  canonical  specihcations  proved  useful  for  assigning  LVD 
classes  to  Chinese  verbs  whose  Optilex  gloss  was  not  “parsable”  by 
our  gloss  extraction  procedure.  For  example,  the  Chinese  verb  m 
(aida)  has  only  a  single  Optilex  translation:  take  a  beating.  This  verb  is 
associated  with  the  HowNet  concept  |  Suffer |,  which  has  as  its  canonical 
specihcation  “31. 3. d  Marvel,  suffer.”  Thus,  our  technique  associates 
m  verb  with  this  canonical  specihcation. 

A  similar  approach  is  used  for  unknown  or  misspelled  words.  For 
example,  the  translation  of  (shusong)  as  in  Optilex  is  misspelled 
as  tranport.  Because  this  verb  is  associated  with  HowNet’s  |Transport| 
concept,  we  associated  this  verb  with  the  canonical  specihcation  “11.1 
Send,  transport.” 

5.3.  HowNet  Gaps 

In  some  cases,  the  HowNet  hierarchy  incorrectly  associates  a  Chinese 
word  with  a  particular  concept.  For  example,  HowNet  incorrectly  as¬ 
sociates  the  two  Chinese  verbs  (zhahua)  and  (xiuhua)  with 


output-paper.tex;  1/10/2001;  15:22;  p.22 


Construction  of  a  Chinese-English  Verb  Database 


23 


the  |Decorate|  concept.  These  two  verbs  are  translated  as  embroider  in 
LVD  class  26.1.b  (Bnild),  bnt  their  meaning  is  closer  to  sew  flowers. 
That  is,  the  patient  is  incorporated  into  the  verb,  which  means  the 
thematic  grid  _ag_th_goal(into)  ,ben(for)  does  not  match  that  of 
the  HowNet  concept  (agent  .possession,  source) . 

Discrepancies  in  HowNet  are  detected  by  means  of  LVD-class  fre- 
qnency  for  a  particnlar  HowNet  concept.  Ont  of  the  17  verbs  associated 
with  HowNet’s  |Decorate|  concept,  only  two  of  them  (the  two  mis- 
categorized  Chinese  verbs)  are  associated  with  an  LVD  class  that  is 
not  9.9  or  9.8.  As  in  the  gap-recovery  described  approaches  above, 
onr  techniqne  associates  the  miscategorized  verbs  with  the  canonical 
specihcation  “9.8.b  Fill,  decorate. 


6.  Results 

Using  the  gap  compensation  techniqnes  described  above,  we  have  achieved 
a  more  rehned  HowNet-to-LVD  mapping,  providing  an  increase  in  LVD- 
classihed  Chinese  words  from  the  previons  8089  entries  to  the  cnr- 
rent  expanded  set  of  17284  LVD-classihed  Chinese  words.  This  section 
presents  a  description  of  onr  coverage,  comparing  HowNet  to  LVD,  and 
then  provides  a  qnantitative  evalnation. 

6.1.  Coverage  of  HowNet/LVD  Alignment 

The  histogram  in  Table  V  characterizes  the  nnmber  of  LVD  classes 
reqnired  for  coverage  of  709  HowNet  concepts.  We  considered  this  ini¬ 
tial  experiment  to  be  a  snccess  for  several  reasons:  (1)  In  359  cases 
(50%  of  the  HowNet  concepts),  the  partitioning  corresponded  to  3  or 
fewer  LVD  classes;  (2)  Most  HowNet  concepts  with  2  or  more  partitions 
had  a  very  heavy  association  with  a  single  LVD  class  (60%  or  higher), 
with  most  other  partitions  falling  aronnd  20%  or  lower;  (3)  Only  2 
cases  did  not  correspond  to  any  LVD  class  (i.e.,  degenerate  HowNet 
concepts  for  which  no  correlations  with  LVD  conld  be  fonnd);  (4)  There 
were  virtnaUy  no  partitionings  (a  handfnl  of  single  HowNet  concepts) 
exceeding  13  LVD  classes. 

At  the  time  of  this  initial  experiment,  the  HowNet  resonrce  did 
not  inclnde  English  translations.  Althongh  the  translation  resonrce  we 
nsed  was  the  CETA/Optilex  dictionary,  onr  techniqne  was  developed  to 
accommodate  any  arbitrary  translation  resonrce  for  mapping  between 

Ultimately,  the  miscategorized  verbs  should  be  disassociated  from  the  HowNet 
concept,  but  there  is  currently  no  way  to  tease  apart  such  cases  from  the  Optilex 
gaps.  Thus,  the  two  are  treated  identically. 
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Table  V.  Distribution  of  HowNet  Concepts  by  Number  of  Intersecting  LVD 
Classes  using  Canonical  Specifications 

#  of  HowNet  Concepts 


1  LVD: 

0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11  1 

1  HowNet:  | 

2 

84 

143 

132 

116 

64 

56 

32 

24 

18 

9 

14  1 

1  LVD: 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

1  HowNet:  | 

3 

6 

1 

1 

1 

0 

1 

1 

0 

0 

1 

HowNet  concepts  and  LVD  classes.  However,  the  most  recent  release 
of  HowNet  associates  English  glosses  with  each  word  in  a  class.  To 
assess  the  impact  of  additional  translation  resonrces,  we  performed  a 
three-way  comparison,  performing  the  HowNet-LVD  mapping  with:  (!) 
Optilex  translations  alone;  (2)  HowNet  translations  alone;  and  (.3)  a 
merged  resonrce  inclnding  translations  from  both  HowNet  and  Optilex. 
We  then  compnted  precision  and  recall  measnres  for  HowNet-LVD  map¬ 
ping  for  each  of  the  individnal  resonrces  relative  to  the  merged  resonrce 
and  to  each  other.  We  also  assess  the  impact  of  nsing  canonical  grid 
information  to  aid  in  the  assignments.  The  resnlts  appear  in  Table  VI 
below. 

We  hnd  that  the  HowNet  resonrce  achieves  higher  precision,  as 
might  be  expected  since  the  available  translations  are  hmited  to  those 
the  designer  believed  appropriate  for  each  class.  The  Optilex  resonrce 
achieves  higher  recall  by  drawing  from  a  wider  variety  of  alternate 
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Table  VI.  Precision  and  Recall  for  HowNet-LVD  Mappings  (with  and  withont 
Canonical  Grid  Information) 


Contrast 

Precision 

w/o  Canon  |  w/  Canon 

Recall 

w/o  Canon  |  w/  Canon 

HowNet  vs  Optilex 

0.61 

0.65 

0.46 

0.55 

Optilex  vs  HowNet 

0.42 

0.51 

0.61 

0.65 

HowNet  vs  Merged 

0.79 

0.82 

0.61 

0.67 

Optilex  vs  Merged 

0.71 

0.75 

0.79 

0.80 

translations.  The  canonical  grid  information  improves  aU  measnres  and 
smooths  differences  between  the  resonrces. 

If  we  fnrther  examine  the  translations  nsed  to  make  these  assign¬ 
ments,  we  hnd  7653  Chinese- word/English-gloss  pairs  in  common,  17609 
pairs  from  HowNet,  and  14252  from  Optilex,  from  a  total  of  24205 
assigning  pairs.  The  resnlts  indicate  that  a  merged  translation  resonrce, 
drawing  from  both  HowNet  and  LVD/Optilex,  can  prodnce  a  richer 
and  more  robnst  mapping  among  the  concept  classes.  For  example, 
the  HowNet  concept  |WeatherChange|  is  associated  with  three  verbs, 
VM  (rain),  VS  (snow),  and  ^1^  (fall  all  over  the  area).  Whereas 
the  hrst  two  verbs  have  translation  eqnivalents  that  link  directly  into 
onr  thematic-grids  (and,  hence,  onr  WordNet  senses),  the  third  verb  is 
WHrdNet  linked  solely  by  virtne  of  onr  thematic-grid  matching  rontine. 
This  rontine  allows  ns  to  determine  that  the  closest  Enghsh  eqnivalent 
for  ^1^  is  precipitate — a  verb  that  does  not  show  np  in  the  HowNet 
hierarchy.  Thns,  onr  integration  of  LVD/Optilex  with  the  HowNet  re¬ 
sonrce  has  provided  a  more  comprehensive  linking  to  thematic  grids 
and  WordNet  senses  than  wonld  be  available  in  either  resonrce  alone. 

6.2.  Quantitative  Evaluation 

In  addition  to  the  qnalitative  descriptions  of  coverage  and  compar¬ 
isons  of  HowNet-LVD  alignment  described  above,  we  implemented  a 
qnantitative  analysis  of  the  effectiveness  of  onr  semi-antomatic  lexi¬ 
con  creation  process.  We  compare  the  antomatic  assignments  of  LVD 
class  and  theta  grid  to  Chinese  word  with  mannal  assignments  by  two 
Chinese  langnage  experts. 

We  compare  the  mannally  and  antomatically  assigned  LVD  class 
and  theta  grid  labels  for  a  set  of  272  separately  hand-tagged  Chinese 
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Table  VII.  Precision  and  Recall  of  onr 
HowNet-LVD  Alignment 


Criterion 

Precision 

Recall 

LVD  Grid  -  Auto 

0.62 

0.25 

LVD  Grid  -  Random 

0.10 

0.0049 

LVD  Grid  -  Most  Freq 

0.357 

0.07 

LVD  Class  -  Auto 

0.63 

0.29 

LVD  Class  -  Random 

0.035 

0.02 

LVD  Class  -  Most  Freq 

0.18 

0.043 

verbs.  For  these  verbs,  the  semi-automatic  linkage  technique  proposes 
577  assignments  of  class  or  grid,  some  of  which  are  duplicates.  Man¬ 
ual  assignment  resulted  in  1188  distinct  theta  grid  labels  and  1282 
LVD  class  labels.  We  report  precision  and  recall  measures  for  these 
two  types  of  labels  relative  to  manual  assignment.  We  also  contrast 
the  effectiveness  of  two  plausible  naive  strategies  as  basehnes:  random 
assignment  of  LVD  class  and  theta  grid  labels  and  assignment  of  most 
frequent  label  based  on  an  English  verb  lexicon  with  more  than  10000 
entries. 

Our  criteria  for  agreement  between  manual  and  automatic  assign¬ 
ments  are  as  follows: 

—  LVD  classes  are  said  to  agree  if  the  major  and  minor  LVD  classes 
match  exactly,  e.g.  40.7.ii.a  matches  40. 7. i. 

—  Theta  grids  agree  if  the  same  roles  appear  in  the  same  order, 
without  regard  with  obligatory  versus  optional  distinctions,  e.g. 
_ag,th  matches  _ag_th. 

These  results  appear  in  Table  Vll. 

We  achieve  precision  of  approximately  0.62  for  both  LVD  class  and 
theta  grid  assignment;  recall  levels  are  lower,  at  approximately  0.24. 
As  the  table  illustrates,  the  automatic  technique  we  have  developed 
substantially  outperforms  either  random  or  most  frequent  LVD  class 
assignment  and  random  theta  grid  assignment.  While  still  large,  the 
contrast  with  most  frequent  thematic  grid  assignment  is  less  dramatic. 
The  relatively  good  performance  of  most  frequent  theta  grid  assignment 
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is  easily  explained  by  the  fact  that  28%  of  the  verbs  can  appear  as  the 
basic  transitive,  the  most  common  thematic  grid.  Thns,  the  transitive 
grid  assignment  prodnces  precision  of  0.357. 

The  relatively  lower  nnmbers  for  recall  are  best  nnderstood  in  terms 
of  two  featnres.  First,  this  techniqne  is  more  focnsed  on  high  preci¬ 
sion  than  on  recall.  Second,  the  “majority  rnles”  strategy  for  selection 
among  alternative  likely  assignments  will  tend  to  prefer  more  common 
class  assignments,  when  less  freqnent  acceptable  class  assignments  are 
available. 


7.  Summary  and  Future  Work 

We  have  presented  an  approach  to  ahgning  two  large-scale  online  re- 
sonrces,  HowNet  and  LVD.  The  lexicon  resnlting  from  this  approach 
is  large-scale,  containing  18530  Chinese  entries.  The  techniqne  for  pro- 
dncing  these  hnks  involves  matching  thematic  grids  in  HowNet  with 
those  in  LVD.  Onr  resnlts  indicate  that  the  correspondence  is  very  high 
between  the  709  Chinese  HowNet  concepts  and  the  492  LVD  classes. 
We  see  onr  techniqnes  as  the  hrst  step  toward  a  general  approach  to 
bnilding  repositories  for  interlingnal-based  NLP  applications. 

Onr  work  has  shown  that  it  is  possible  to  combine  different  types 
of  knowledge  from  existing  resonrces  in  ways  that  improve  npon  the 
coverage  and  robnstness  of  each  of  these  independent  resonrces.  One 
area  of  investigation  that  has  allowed  ns  to  enrich  the  existing  resonrces 
is  the  development  and  application  of  gap  compensation  techniqnes, 
allowing  ns  to  hll  in  possible  Chinese- Enghsh  links  where  none  existed 
previonsly. 

We  are  cnrrently  nsing  the  lexicon  for  word-sense  disambignation 
in  machine-translation  and  cross-langnage  information  retrieval.  As  we 
saw  above  the  Chinese  verb  (la)  has  several  possible  translations,  bnt 
not  all  of  these  wiU  be  appropriate  in  every  context.  If  we  can  determine 
which  HowNet  concept  corresponds  to  (la),  then  we  will  translate  it 
appropriately.  For  example,  if  the  HowNet  concept  is  | Transport!,  ^^e 
translation  wonld  be  ship  or  transport,  bnt  not  slash,  chat,  implicate, 
etc.  We  can  detect  which  HowNet  concept  is  appropriate  by  examining 
the  other  words  in  the  sentence.  If  those  words  co-occnr  with  other  Chi¬ 
nese  verbs  associated  with  a  particnlar  HowNet  concept  (as  determined 
throngh  a  corpns  analysis),  then  it  is  likely  that  that  HowNet  concept 
is  the  appropriate  one  for  the  Chinese  verb.  That  is,  if  we  hnd  other 
verbs  from  a  given  HowNet  concept  occnrring  in  the  same  context,  then 
we  can  hypothesize  that  this  particnlar  verb  has  the  meaning  of  this 
HowNet  concept. 
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The  algorithm  for  mapping  between  HowNet  concepts  and  LVD 
classes  reqnires  a  “training”  step — i.e.,  the  seed  mappings  given  earlier. 
However,  it  is  possible  to  prodnce  a  ranked  mapping  between  thematic 
grids  by  connting  correspondences  between  LVD-based  roles  and  the 
HowNet-based  roles  across  the  entire  concept  space.  This  approach  is 
also  cnrrently  nnder  investigation. 

Another  area  of  investigation  is  the  nse  of  a  WordNet-based  distance 
metric  (e.g.,  the  information-content  approach  of  (Resnik,  1995))  for 
additional  prnning  power  in  the  HowNet-to-LVD  alignment.  Becanse 
each  of  the  entries  in  the  LVD  classihcation  is  associated  with  a  Word- 
Net  sense,  it  is  possible  to  rnle  ont  certain  class  assignments  for  a  given 
HowNet  concept  by  examining  semantic  distance  between  the  Optilex 
glosses  for  a  particnlar  Chinese  word  and  the  glosses  for  other  words 
associated  with  that  concept. 
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